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ABSTRACT 

In light of the widespread use of coipetency testing, 
the authors consider that it is isportant to detecsine ways of 
developing and using cospetency testing to insure that it achieves 
its full potential. The paper, in three parts, introduces a sodel for 
the developsent and validation of cospetency tests, reviews several 
■ethods for setting standards or cospetency levels, and sakes 
suggestions for future research and developsent. Firstly, definitions 
of cobpetency testing, criterion referenced tests, and standards are 
provided. The twelve step developsent and validation sodel introduced 
incorporates: cospetency selection: test specification: writing and 
editing test itess: deter sining content validity: further editing: 
test assesbly: standard setting: test adsinirtration: collection of 
reliability, validity, and nors data: preparation of users and 
technical nanuals: periodic collection of additional inforsatiou. The 
standard setting sodels considered are continnus aodels, of which the 
■a lor assnsption is that sastery is a continuously distributed 
ability. These sodels are further subdivided, for descriptive and 
cOBparative purposes, into judgse-^.tal, sspirical, and coabination 
■odels. The characteristics of the nineteen sodels thus categorized 
are then discusssd. The developsent of guidelines for cospetency test 
developsent and farther work on the soral and technical issues 
involved in standard setting are recossended. CABPI 
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The establishment of minimum competency testing programs in elementary 
and secondary schools, and for many professions, has reached immense pro- 
portions (or epidemic proportions, if you view the trend negatively). 
For example, well over half (33 to be exact) of cur states have passed 
legislation requiring assessment of the "competence" of their elementary 



and high school students (Pipho, 1978). Further, many of these states 



a set of competencies in order to receive a high school graduation diploma. 
Why are so many state legislatures mandating minimum competency testing? 
It appears that it is to discourage schools from the practice of pro- 
moting all students and awarding high school graduation diplomas based 
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require that students demonstrate at least a minimum level of performance on 



on school attendance only. It is common for legislators and 
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parents to say that minimum requirements in the "basic akllls** 
must be set for students to graduate with a diploma which has some 
meaning. Perhaps it is not surprising to observe that participating 
states are approaching the task of establishing minimum competency test- 
ing programs differently. Some states are emphasizing "life skills,** 
others "school skills/' and yet other states have incorporated both 
types of skills into their competency testing programs. Also, the school 
years in which testing Is done varies from one state to the next. Finally, 
there are variations in the ways competencies are identified and measured, 
and standards set (Haney and Nadaus, 1978). 

The rapidity of change in school, district, and statewide testing 
programs and the demand for high quality tests has dictated that sub- 
stantial research and development work be undertaken. Included among the 
more important research and development topics are: Identification and 
definition of competencies, management of competency testing programs, 
development and validation of competency tests, methods of determining 
standards, and uses and interpretations of competency test scores 
(Brjckell^ 1978). 

Other speakers at this AERA Competency Testing Conference have 
considered the philosophy and assumptions of competency testing programs, 
as well as their potential (and in some cases, demonstrated) effects on 
student performance and school cirricula. Our contribution to the con- 
ference will be to consider some ways for developing and using competency 
tests to insure that competency testing programs achieve their full 
potential, whatever that potential may be. Specifically » this paper was 
prepared to accoiiq>lish three purposes: y 

1. To introduce a model for developing and validating competency 
tests. 



3 



2. To provide a review of several promising methods of determining 
"standards" or "minimum performance levels." 

3. To offer several suggestions for future research and development. 
We wlll'not di'bate the merits of competency testing in this paper. 

Others are far more Informed about" the Issues and capable of articulating 
them to those who have an Interest. Our work will begin at the point 
where (1) a decision has been made to initiate a competency testing 
program, and (2) a set of competencies has been identified and tests to 
measure individual performance on the competencies are required. Three 
other points concerning our work should also be mentioned: 

1. Attention is focused on the use of competency tests for making 
decisions about individuals . When groups of examinees are of 
primary Interest (as in program evaluation studies or many state- 
wide testing programs), approaches to competency test development 
and test score usage are somewhat different. (For example, in- 
dividuals and test items can be sampled— I.e. , matrix sampling 

is used— and V^ndards" are set for group performance.) 

2. Many of our exampl^will be from elementary and secondary school 
setting although most^f the testing technology discussed applies 
equally well to tne devWopment of competency tests in other con- 
tent areas. \ 

3. We will focus on the construction of paper and pencil tests. 
Steps for constructing performance tests are basically the same 
but special attention must be given to topics such as the design 
and use of behavioral checklists, and inter-rater reliability. 

The remainder of the paper is divided into three sections: 
Development and Validation of Competency Tests* Methods of Standard Set- 
tings, and Suggestions for Future Research and Development. 
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Developinent and Valldatlor* of 
Competency Tests 



A Competency Test 

Perhaps we should begin with a definition of a competency test: 

A competency test is designed to determine an 
examinee ' s level of performance relative to 
each competency being measured. Each compe- 
tency is described by a well-defined behavior 
domaia . 

The definition makes clear that the purpose of a competency test is to 
provide information about an individual examinee's level of performance 
on each competency which is measured by a test. There will be as many 
test scores as there are competencies measured by a test. Also, 
competencies are clearly %n:itten so that there will be a high level of 
agreement among users of the test about the content (behaviors) oefining 
the competency. This desirable goal can be accomplished through the use 
of "domain specifications" (Popham, 1978a). This term will bie described 
in more detail later. There is one other point. There is nothing 
inherent in the definition of a competency test which requires test scores 
to be compared to "standards." In fact, the percenjage^ scores (reported 
by comptttency) provide excellent descriptive information about examinee 
performance. Since it is common, however* to interpret examinee test 
performance relative to standards, (an examinee who scores equal to or 
above a standard set at 70X [say] on the set of test items 
included in a competency test is described 

as a "master" or "competent"), it is necessary to Introduce a n^w term, 

"minimum competency testing." 

A minimum competency test ie designed to determine 
whether an examinee has reached a prespeclfied level 
of performance relative to each competency being 



measured. The "prespeclf led level" or "standard" may 
vary from one competency to the next. Also, each 
competency Is described by a well-defined behavior 
domain. 

A "standard" (sometimes it is called a "cut-off score" or a "minimum proficiency 
level") is a point on a test score scale which is used to separate exam- 
inees into two categories 9 each reflecting a different level of proficiency 
relative to the competency measured by the test under consideration. It 
is common to assign labels such as "master" or "competent" to those persons 
in the higher-scoring category and "non-master" or "incompetent" to those 
persons in the lower-scoring category. Note that if a test measures more 
than a single competency and if examinees are to be classified into com- 
petency categories based on their performance on each set of items measuring 
a competency^ as is often the case» a standard is set for each cpmpetency 
measured by the test. There will be as many competency decisions as 
there are competencies measured by the test. 

It is important at this point to separate three types of standards . 
Consider the following statement: 

School district A has set the following target — 
It desires to have 85Z or more of its students 
in the second grade achieve 90Z of the reading 
objectives at a standard of performance equal 
to or better than 80Z. 

Three types of standards are involved in the example: 

*1. The 80Z standard is used to interpret examinee performance 

on each of the objectives measured by a test. 

2. The 90Z standard is used to interpret examinee performance 
across all of the objectives measured by a test. 

3. The RSZ standard is applied to the performance of second graders 
on the set of objectives measured by a test. 

Only the first use of standards will be of interest in this paper. 



-6- 



From the definitions above, it is clear that minimal competency tests 
are a special type of competency test (tests where standards are introduced 
to interpret examinee performance) and as we shall see later, competency 
tests are a special type of criterion-referenced test (i.e., those tests 
which are used usually in certification and licensing situations) . 

Finally, there is nothing inherent in the definition of competency 
testing (or minimum competency testing) which precludes the measurement of 
school skills (for example, arithmetic, spelling, and reading) or life skills 
(for example, balancing a check book, following directions, or answering 
a job advertisement). 

Competency Tests and Criterion-Referenced Tests 

Tne compecency testing technology would be in an embryonic stage 
were it not for the work done in developing a criterion-referenced 
testing technology since the late 1960*s. A competency test is simply 
a particular kind of criterion-referenced test and therefore, like a 
criterion-referenced test, it must be developed and used in ways 
somewhat different to better-known norm-referenced tests. Glaser (1963) 
and Popham and Husek (1969) introdurad the notion of criterion-referenced 
testing so that test score information of the type needed to make a 
variety of individual and programmatic decisions would he available. 
Norm-referenced tests are designed, principally, to facilitate the use 
of scores derived from the tests to make comparative statements about 
individuals. This is not the primary type of information required by 
individuals who implement competency-based testing programs. They 
require information about the level of individual performance relative 
to well-defined content domains (referred to as "domain specifications"). 




A considerable amounc of progress has been made during the last 
ten- years toward the establishment of a practical and usable criterion- 
referenced testing technology. The existence of this technology (see, 
for example, Hambleton & Eignor, 1978; Hanibleton, Swaminathan, Algina & 
Coulson, 1978; Millman, 1974; Popham, 1978a) makes it possible, among 
other things, to develop criterion-referenced tests for use in diagnosing 
student learning deficiencies, monitoring student progress, and evaluating 
school programs. The same basic technology is useful also for individuals 
who must develop and validate minimum competency tests for (say) high 
school graduation, although matters such as the selection of competencies 
for inclusion in a cesc and approaches for developing and validating 
tests will be handled somewhat differently. 

At what stage of development is a competency testing technology? 
There would be considerable agreement among measurement specialists on 
the statements offered below: 

1. Definitional problems have been sorted out (for example, distinc- 
tions among norm-referenced, criterion-referenced, competency-based, 
domain-referenced, and objectives-referenced tests are clear). 

2. The need for "domain specifications" is clear and adequate methods 
for developing them do exist. 

3. There is at least an adequate technology available for developing 
and validating competency tests. 

4. The problem of test score reliability has been articulated clearly 
and approaches now exist for determining reliability of scores 
for various intended uses. 

5. Methods for using and reporting competency test score information 
are available* 

8 
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The interestjed reader is referred to Hambleton et al. (1978) and 
Popham (1978a) for further discussion of the points above. 

Of course 9 there remains a considerable amount of work to be 
done. The four topics below are especially important: 

1. Improved guidelines for preparing domain specifications, 

2. Guidelines for evaluating competency tests and test manuals, 

3. Research on the relationships among test length, test 
score reliability and test score validity, 

4. Further consideration of issues and methods of determining 
standards, and of guidelines for implementing each of 

the methods. 

How should a competency test be developed and validated? This problem 
is addressed in the next section of the paper. 

Steps in Test Development and Validation 

A twelve step model for developing and validating competency 
tests is presented in Figure 1. The importance of each step in the 
model depends upon the size and scope of the test development and 
validation project. An agency with the responsibility of producing 
a state-wide competency test will proceed through the steps in a rather 
different way from a small consulting firm or a school district. 

In brief » the twelve steps are as follows: 

Step In competencies must be prepared or selected before the 
test development process can begin. 

Step 2— Test specifications are needed to -larify the test's 

purposes, desirable item formats, number of test items, 
instructions to item writers, etc. 
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1. Preparation and/or Selection of Competencies 

2. Preparation of Test Specifications (for example, Specification 
of Item KormntR, Appropriate Vocabulary, and Number of Test 

I terns / Compe t ency ) 

3. Writing Test Items "Matched" to Competencies 
4* Editing Test Items 

5. Determining Content Validity of the Test Items 

a. Involvement of Content Specialists 

b. Collection of Student Response Data 

6. Additional Editing of Test Items 

7. Test Assembly 

a. Determination of Test Length 

b. Test Item Selection 

c. Preparation of Directions 

d. Layout and Test Booklet Preparation 

e. Preparption of Scoring Keys 

f. Preparation of Answer Sheets 

8. Setting Standards for Interpreting Examinee Performance 

9. Test Administrations 

10. Collection of Reliability, Validity and Norms Information 

11. Preparation of a User's Manual and a Technical Manual 
12.. Periodic Collection of Additional Technical Information 



Figure 1. Steps for Developing and Validating Competency Tests. 
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Step 3" ! terns are prepared to measure competencies included in the 
test (or tests, if there are going to be parallel- forms ♦ 
or levels of a test varying In difficulty). 

Step 4" Inltial editing of Items Is completed by the Individuals 
writing them* 

Step 5 — A systematic assessment of Items prepared In steps 2 and 
3 Is conducted to determine Item validities. Es- 
sentially, the task is to determine the content validity 
of the test items. 

Step 6" Based on the data from step 5, it is possible to do 
further* item editing, and in some instances, discard 
items that do not at least adequately measure the 
competencies they were written to measure. 

Step 7 — The test (or tests) can be assembled. 

Step 8" A wethod for setting standards to interpret examinee performance is 
selected, and Implemented. 

Step 9 — The test (or tests) can be administered. 

Step 10" Pat a addressing reliability, validity, and norms can be 
collected and analyzed. 

Step 11" A user's manual and a technical manual should be pre- 
pared. 

Step 12- Thi8 step is included to reinforce the point that it is 
necessary, in an on-going way, to be compiling technical 
data on the test items and tests as they are used in 
different situations with different examinee populations. 

Whether a competency test or a minimum competency test is being developed, 
steps one to six will be the same. At step seven, it is possible (al- 
though not essential) that different methods will be used to select /est 
items. Step eight is unique to minimum competency testing. Remaining 
steps in the model (steps 9 to 12) are essentially the same for the two 
types of tests. About the only differences are those concerning approaches to 
validating test scores. Clearly, since the two types of tests are in- 
''tended to accomplish different purposes, approaches for validating test 
scores will, in general, be different. 
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Four of the steps (1, 3, 5, and 7) In developing a competency test 
will be discussed next. Useful references for an expanded discussion 
of the other steps are Hambleton and Eignor (1978); Hatnbleton, Swaminathan, 
Algina, and Coulson (1978); Millraan (1974); and Popham (1978a). 

1. Statement of Competencies .— It is popular to write competencies in 
"behavioral terms." However, while behavioral statements have some 
desirable features (for example, they are relatively e^sy to produce), 
they often lack the clarity necessary to permit a clear determination 
of the domain of test items measuring the behaviors defined by a 
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competftncy. If the proper domain of test Items measuring a competency 
is not clear, the task of preparing valid test Items Is more difficult. 
Also, It 18 impossible to select a representative sample of test Items 
from that domain If the domain Is not clearly specified'. Since it Is often 
desired to Interpret examinee performance on a sample c. test Items 
measuring a particular competency as an estimate of that examinee's level 
of performance In the larger domain of Items, It is essential to have the 
domain of test Items specified clearly, and to choose a representative 
sample of test items. 

Domain specifications are an important new development in compe- 
tency testing (Popham, 1978b). Domain specifications clarify the intended 
content specified by a competency. Such information is invaluable to 
teachers (they must teach the competencies defined In the domain specifi- 
cations), to parents (they often wish to have information about the 
competencies), and to ttem writers (they must produce "valid" test items, 
i.e., test items that are representative of the domain of items measuring 
each competency). There are at least four steps outlined by Popham for 
the development of domain specifications. The first involves the prepar- 
ation of a general description. The general description could be a 
behavioral objective, a detailed description of the competency, or a 
short cryptic descriptor. Next, a sample test item is prepared. This 
will help to clarify the domain of test items and to specify item format. 
The third atep is perhaps the most difficult. It is necessary to indi- 
cate the content included in the domain. In the^final s.ep, character- 
iatica of responae alternatives or response limits are specified. An 
cxa^»l« of a domain apecification is shown in Figure 2. 

13 
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SKILL ; The student will identify the tone or emotion expressed in a 
paragraph • 

SAMPLE ITEM : 

Directions: Read the paragraph. Underline the best word to 
complete the sentence. 

Jimmy had been playing at the beach all day. 
It was time to go home. Jimmy sat doifn in the 
back scat of the car. He could ha dly keep his 
eyes open. 



Jimmy felt • 

A. afraid B. friendly C. tired D. kind 

CONTENT: 



1. The paragraph will contain situations which are familiar 
to the students being tested. 

2. The paragraph will contain no less than three and no more than 
six sentences. The readability level will be no higher than 
Second Reader. 

3. The emotions expressed will be from the fullov/ing list: 

sad mad angry 

tired scared friendly 

happy lucky smart 

kind excited proud 



RES^ W SE MOD E: 

1. Respunsfs will le uuc word in liM^gth. 



2. The iloms will contain one corrocL and three incorrect responses. 

3. Dintrailors are to he warils dcfu ribinp, a feeling and imiy he 
*nken from the list above. 

4. Avoid having distractorr. as possible answer.*;, (i.e., in the 
sample item, ''mad'* would not be. a goud choJco for a distractor. 
Jimmy could feel mad about leaving tlie beach.) 



Figure 2. An example of a domain specification from the. 

reading area. (The authors are grateful to 
Marlene Teichert for the example.) 
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The Important aspect of implementing the steps is that they lead to 
specified item domains; it is not necessary, however, that homogeneous 
content domains be produced. Specificity and homogenei y 
are different concepts. Millman (1974) makes this point, "The domain 
being referenced by a [criterion]-referenced test may be extensive or 
a single, narrow objective, but it mpst be well defined, which means 
tnat content and format limits must be well specified" (p. 314). 

3» Generation of Test Items . — Once domain specifications are defined, 
the test constructor must generate test items. If the domains are d.. - 
fined in a perfectly precise manner, then the items themselves would not 
need to be generated. The items would simply he a logical consequence 

the domain definitions (for example, see Hivelyi Patterson, & Page, 1968). Un- 
fortunately, h()wever, such precision will seldom be achieved in practice 
and so test items must be produced and procedures, like those described 
in step five, used to check the adequacy of the test items. 

Principles of item writing used in norm-referenced achievement 
test construction apply to competency tests as well. It is 
necessary though, for item writers to attend closely to the domain speci- 
fications. Test items should be written to "tap" behaviors in the domain 
of behaviors defined by the domain specifications. After editing of the 

tc^^t items, the next step Js to determine the* item validities. 

\ 

5. Determination of Content Validity . —Generally speaking, the 
quality of competency test items can be d^'«2rmined by the extent to which 
they reflect, in terms of their content, the domains from which they were derived. 
The problaa here is one of item validation; unless one can say with a high 
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degree of confidence that the items in a coropetenfcy test measure the 
Intended competencies » any use of the test score information is question- 
able. When domain specifications are utilized » the domain definition 
is never really precise enough to assume a priori that the items are 
valid. Thus the quality of the items must be determined in a context 
Independent from the process by which the items were generated. This 
is an a posteriori approach to item validation* Some procedures 
have been designed to assess whether or not a direct relationship between 
an item and a domain or objective exists through analysis of data col- 
lected after the item is written (Hambleton & Eignor, 1978; Hambleton & 
Fitzpatrick, in preparation; Popharo, 1978a), ^ 

There are two approaches ynich n^Y \xsed to establish the (con- 
tent) validity of test items. The first approach, and the approach 
we feel holds the most merit, involves the Judgment of test items by 
content specialists. The Judgments that are made concern the extent of 
"match" between the test items and the domain they are designed to 
measure* Questions asked of content specialists about content validity 
of test items can be reduced to two important ones: 

1. Is the format and content of an item appropriate to measure 
some part of the domain specification? 

2. Does the available set of test items adequately sample a 
particular dotoain? 

A second approach is to apply empirical techniques to exairiinee response 
data in much the same way empirical techniques are applied in norm-referenced 
test development. In fact, along with some recently developed empir- 
ical procedures for competency tests » several norm-referenced test item 
statistics can (and should) be used* The problem is to ensure that 
these statistics are used and interpreted correctly in the context of / 
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competency test development. Item statistics should be used to detect 
aberrant items that need to be reworked, and not to make final decisions about 
which ite»ns are to be Included in a competency test. An excellent review 
of item statistics for use with competency tests has been prepared by 
Berk (1978). 

7. Test Assembly . — The length of a competency test (or more 
importantly, the number of test items measuring each competency in a 
test) is directly related to the usefulness of the test scores obtained 
from the test. Short tests typically produce imprecise competency score 
estimates, and lead to competency decisions which prove to be incon- 
sistent across parallel-^form administrations (or retest administrations). 
(An examinee competency score is the proportion of items in the pool of items 
defined by a domain specification that the examinee can answer correctly. 
A competency score estimate is obtained by administering a sample of items 
to the examinee and calculating his/her proportion-correct score.) 

Three factors should be considered in making decisions about the 

number of items i 

1. the relationship between nuaibeT of test items and the importance 
placed upon the particular competency, 

2. the relationship between the number of test items and the mi»;lmum 
acceptable level of test score reliability, 

3. the relationship between the number of Items andavai7nblo testing time. 
In terms of factor one, it may be the case that some competencies 

are more important relative to the goals of the competency testing program 
Jthan othars. If the test developer plans for the test to cover multiple 
compatancias, he/sha should than plan, whan drawing samples of items froir 
aach domain of items "keyed" to a competency, to more heavily sample the 
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most Important competercles. 

In referf£*nce lo factor tvo. triv r»^l,iL »<>•.:...; , u. : '(•. i.vi'.:1>im of test 

Items to minimum reliability requirements, ^uidojines arc not readily 

available. The Spearman-Brown formula, which rtlntes test length to 

reliability, is reasonable to use only with norm-rofcrenced tests. Sin-^ 

ilar relationships need to he developed for competency tests. The following 
procedure should be helpful to those determining test length when competency 

score estimation is the problem of interest. The solution is a con- 
servative one, i.e., test lengths determined by this method will be a 
little longer than they need to be to obtain the degree of precision 
required by the test developer. The formula^ is: 



Test Length , » 



(degree of prccision)2 



Ask yourself (or interested others) : W\at degree of procision is 

required of the competency score estimates? Discuss the degree of 
precision question in the same way. you wmiUI the standard error 
of measurement. A primary difference bctwcpu the two is that 
competency score estimates are defined on a scale (0, 1). 

At present we are working on tables rel.iiLu/; test length to reli- 
ability when the test is used for making competent/ incompetent decisions about^ 
examinees. The research is Just beginning;, thus, we are unable to report 



1 



The formula can bo derived from the binomial lest model. 
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any results at this time. However, two points can bci made. One» it 
is unlikely that fewer than five or six items measuring a competency 
will produce desired levels of reliability. Two» while no tables or 
formulas exist to connect test length to reliability (or consistency) 
of decision-making, reliability can bo slwILtvl or. pirictiUy afL**r the 
administration of a pool of test items to a group of examincos (step 5b) .^^ 
"Post-hoc" test forms of varying lengths can be constructed and reliability 
estimates may be calculated, on the assumption that examinees would 
have responded in the same way had they been pr*.»bt'nted with the "parallcl- 
forms" rather than a single large pool of test items. By varying the 
length of the forms and the formation of parallel-forms (i.e., which 
items are placed in which forms), the relationship bi*twocn lest length 
and reliability for a specified sample of examinees for a pool of test 
items measuring a particular domain specification can be studied. 

The item selection process is straightforward provided the 
competency test developer has been careful in defining compe- 
tencies and in constructing test items. That is, the test developer 
has to have been careful to define the siiro of his/her domain to be 
consonant with the test's purpose. Jf the purpose of testing is to 
make decisions on, for instance, broad ^^chool competencies, large 

domain sizes can be tolerated. If, however, tlie purpose of testing is 
to provide information for remedial instruction, a smaller domain size 
Is needed. Popham (1978b) has ofTercd some suf,>»esi i(Mis for ascertaining 
domain sise. The critical point for item .select ioii Is tlmt the domain 
^ « reasonable size so tliar proper sampling from the domain can 
occur. If the domain is so large that it is difficult to see how 
to gtnertte a set of items from the domain for the test^ then the 
doMln must be broken up, into sub-domain * an'! I touts generated for 
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those sub-domains. The sampling process should be cleat for these 
sub-domains. Thus, it is critical that the domain bo of a size that 
a set of items can be clearly constructed from the domain, and then 
the sampling process can be carried out without complications. 

Having defined a domain size that is manaj»eai»lc- tor sampling; in 
not enough; the test developer must also be careful to asccrtaii that 
•11 the items constructed for the domain do indeed "tap" the behavior 
specified. The items must adhere to the restrictions imposed on the 
domain specifications. 

If the size of the domain is manageable for the sampling process 
and the test developer is sure that the items generated "tap" the spe- 
cified behavioiSy then the item selection process is straightforward. The 
test is constructed by taking either a random or str.itilicd random 
sample of items from the domain. 

One advantage of choosing representative sets of test items is 
that examinee test scores (or proportion-correct scores) provide "un- 
biased*^ estimates of their "true" competency scores. It is possible 
also to set standards and interpret examinee test performance relative 
to these standards. Unfortunately, when the number of test items is 
small (as Is frequently the case), the consistency of decisions (competent/ 
incompetent) acr^oss a retest administration or across a parallel-form 
administration of a test may be distressingly low. Increasing the number 
of test items measuring each competency is helpful but often it Is not 
feasible to do so. One answer to ths dilemma is as follows; When the 
primary purpose of the testing program is to make dichotomous decisions 
about examlMes, a more effective test can be produced if test items 
from the available pool of test items measuring each competency are 
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selected based on their statistical properties. Specifically, if (say) 
a standard is set at 80%, it would be best to select test items which 
have p-values (item difficulty levels) in the region of .80 and which 
have the highest discrimination indices. A test constructed in this 
way will have maximum discriminating power in the region where decisions 
are being made and therefore more reliable and valid decisions will 
result. One possible drawback is that scores derived from the test 
cannot be used to make descriptive statements about examinee levels 
of performance on the competencies measured in the test. This is because 
test items measuring each competency are not necessarily a representative 
sample. In theory, there is at least one way to make descriptive state- 
ments about examinee levels of performance on the competencies 
measured by a test when non-random or non-representative samples of 
test items are chosen. It can be done by introducing concepts and 
models from the field of latent trait theory. The leasibility, however, 
of such an approach has not been tested. 

Methods of Standard-Setting 
Numerous researchers have catalogued many of the available standard 
netting methods (Glass, 1978a; H&mbleton « ^.ignor, 1978; Hambleton et al., 
1978; Jaeger, 1976; Mlllman, 1973; Meskauskas, 1976; Popham, 1978b; 
Shepard, 1976). If one fact is clear it is that all standard setting 
la^hods are arbitrary and this point has been acknowledged by nearly 
ftvery contributor to the area. All of the methods are arbitrary because 
they involve Judgments of one kind or another (for example, raters may 
be aeked to Identify test items which a minimally competent examinee 
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should be able to answer) and choices (for example, a choice of standard- 
setting methods must be made). But the "arbitrariness" of standard- 
setting methods is not a satisfactory reason for rejecting the methods. 
A quote from Popham (1978a) is especially appropriate here: 

Unable to avoid reliance on human judgment as 
the chief ingredient in standard-setting, some 
individuals have thrown up their hands in dismay 
and cast aside all efforts to set performance 
standards as arbitrary , hence unacceptable. 

But Webster's Dictionary offers us two defi- 
nitions' of arbitrary. The first of these is 
positive, describing arbitrary as an adjective 
reflecting choice or discretion* that is, "deter- 
minable by a judge or tribunal." The second 
definition, pejorative in nature, describes 
arbitrary as an adjective denoting capricious- 
ness, that is, "selected at random and without 
reason." In my estimate, when people start 
knocking the standard-setting game as arbitrary, 
they are clearly employing Webster's second, 
negatively loaded definition. 

But the first definition is more accurately 
reflective of serious standard-setting efforts. 
They represent genuine attempts to do a good job 
in deciding what kinds of standards we ought to 
employ. That they are judgmental is inescapable. 
But to malign all judgmental operations as capri- 
cious is absurd, (p. 168) 



■■' \ 
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In a recent review of the standard-setting literature, Hambleton 
and Eignor (1978) discussed six different sets of methods for setting 
standards. This review was an expansion of some earlier 

work by Mlllman (1973) and Meskauskas (1976). A 

discussion of the same sort, adding some standard-setting methods recently 
advanced (I.e., Jaeger, 1978; Zleky and Livingston, 1977), would 
perhaps prove helpful, if only to identify the more than 

twenty methods advanced to date. Such a discussion of methods will not be pre- 
sented here, however, because a large number of these methods do not appear to be 
useful for setting standards In minimum competency testing programs. 
Those methods that appear to us to be applicable will be discussed in 
some detail. Also » a number of comparisons will be made In this 
"sifting out" of relevant methods, tt first being the useful distinc- 
tion made by Meskauskas (1976) between continuum and state models. 



Continuum and State Models 

The basle difference between continuum and state models has to do 
with the underlying assumption made about ability. According to Meskauskas, 
two characteristics of continuum models are: 

1. Mastery Is vltv^d as a continuously distributed ability or set 
of abilities. 

2. An area Is Identified at the upper end of this continuum, aiid 
If an Individual equals or exceeds the lower bound of this 
area, he/she Is termed a master. 
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State models, rather than being based on a continuum of mastery, view 
mastery as an all-or-none proposition (i^e., either you can do some- 
thing or you cannot)* Three characteristics of state models are: 

1. Test true-score performance is viewed as an all-or-nothing 
state* 

2. The standard is set at lOOZ. 

3. After a consideration of measurement errors, standards are 
often set at values less than lOOZ. 

There are at least three methods for setting standards that are 
built on a state model conceptualization of mastery. The models take 
into account measurement error, deficiencies of the examination, etc., 
in "tempering" the standard from lOOZ. These methods ha\^ been referred 
to by Glass (1978a) in his review of methods for setting standards as 
"counting backwards from lOOZ." State model methods advanced ta date 
include the i&astery testing evaluation model of Ebrick (1971), the 
true-score model of Roudabush (1974), and some recently advanced statis- 
tical models of Hacready and Dayton (1977). However, since state 
models are somewhat less usefulness than continuum models in elementary 
and secondary school minimum competency testing .programs, they will not 
"be considered further in this paper. Our failure to consider them fur- 
ther in this paper, however, should not be interpreted as a criticism 
of this general approach to standard-setting. The approach seems to 
be Mpecially applicable with many performance tests. 
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Tradltlonal and Normative Procedures 

Before discussing; further the various rontiniuim modola of stnndnrH 

settings two other models for standard-setting should be mentioned. 

These methods, which seem to have limited value in setting minimum 

competency standards, have been referred to by a variety of names. 

We will call them "traditional standards" and "normative standards." 

Traditional standards are standards that have gained acceptance 

because of their frequent use. Classroom examples include the 90 to 

100 percent is an A, 80 to 89 percent Is a B, etc. It appears that 
from time to time such methods have been used in setting standards 
for minimum competency tests. 

. "Normative" standards refer to ^iny of three different uses of 
normative data, two of which are, at best, questionable. In the fiist 
method, use is made of the normative performance of some external 
"criterion" group. As an example, Jaeger (1978) cites the use of the 
Adult Performance Level (APL) tests by Palm Beach, Florida schools. 
Test performance of groups of "successful" adults were used to set 
competency standards for high school students. The notion is that 
the test performance of "successful" adults provides ^ basis for 

. setting standards for high school students. Such a procedure can be 
criticised on a number of grounds. Jaeger (1978) points out that 
society changes* and that standards should also change. Standards 
based on adult performance may not be relevant to high school students. 
Shepard (1976) points out that any normatively-determined standard will 
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immediately result in a multitude of counterexamples. Further, Burton 
(1978) points out that relationships between skills in school subjects 
and later success in life is not readily determinable, hence, observing 
the degree of achievement on the test of some "successful" norm group 
makes little sense. Jaeger (1978) goes on to say; "There 
are no empirically tenable '*survivax" standards on school-based skills 
that can be justified through external means." * 

A second way of proceeding with normative data is to make a 
decision about a standard based solely on the distribution of scores 
of examinees who take the test. Such a procedure circumvents the 
"minimum test flTcore for success in life" problem, but the procedure 
is still not useful for setting standards. For instance, Glass (1978a) 
ciees the California High School Proficiency Examination, where the 50th 
percentile of graduating seniors served as the standard* What can 
be said of a procedure where whether or not an individual passes or 
fails a minimum competency test depends upon the other individuals 
taking the test? In the California situation, the standard was set 
with no reference at all to the content of the test or the difficulty 
of the test Items* 

The third use of normative data discussed in the literature 
concerns the supplemental use of normative data in setting a standard* 
Shepard (1976) » Jaeger (1978), and Conawa> (1976, 1977) all favor such 
a procedure. Recently Jaeger (1978) advanced a standard setting 
method which requires Judges to make judgments on item content* In 
hid method, Jaeger calls for Incorporation of some tryout test data 
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to aid Judges in reconsidering their initial assessments. Shepard 

(1976) makes the following point: 

Expert Judges^ ought to be provided with normative 
data in the .r deliberations. Instead of relying 
on their experience « which may have been with un- 
usual students or professionals, experts ought to 
have access to representative norms. . .of course, 
the norms are not automlcally the standards. Ex- 
perts still have to decide what "ought" to be, but 
they can establish more reasonable expectations 
if they know what current performance is then if 
• they deliberate in a vacuum. 

We agree with Jaeger, Conaway, and Shepard about the usefulness 
of normative data when used in conjunction with a standard setting 



method. 

Consideration of Several Promising ^ 
Standard Setting Methods 

OthfX methods for setting standards to be discussed in this 

paper are either built on a continuum model of ability or seme other 

unexpressed model. For convenience, the methods under discussion 

were organised into three categories or models. These models and 

methods are presented in^ Figure 3.^ The models are labelled "Judg- 



mental," "empirical," and "combination." By Judgmental is meant 



data are collected from Judges for se^:ting standards, or a Judgment 

iu made about the presence or lack of a variable (for instance, guessing) 




that would effect tlie standard. Empirical methods require the col- 



lection ot examinee response data to aid in the standard-setting process. 
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Figure 3. A classification of models and methods for setiing standards. 



Judgmen^tal Models 



Combination Models 



Empirical lodels 



Item Content * 

Nedelsky (19$4) 

Modified Nedelsky 
(Massif, 1978) 

Angoff (1971) 

Modified Angoff 
(ETS, 1976) 

Ebel (1972) 

Jtteger (1^78) 



Guessing 
Millman (1973) 



Judgmental- 
Empirical 



Contrasting Groups 
(Zleky and Living- 
ston, 1977) 

Borderline Groups 
(Zieky and Living- 
ston, 1977) 



Educational 
Consequences 

Block il^) 



Bayesian Methods 
Hambleton and Novlck (1973) 
Novick, Lewis, Jackson (1973) 

Schoon, Gullion 
Ferrara (1978) 



Data — Two 
Groups 

Berk a976) 



Data-Criteriop 
Measure 

Livingston (1975* 

Livingston (1976' 

lluynh (1976) 

Van der Linden ' 
and Mellenbergh 
(i977) 



Decislot* 'heoretic 
Kriewai: 1972) 



involve the uae of examinee response data. 
ThcM are «l«o applicable to cut-off score determination Uee, ror e^inp . 

A 
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Emplrlcal Methods 

A nuBiber of methods have been developed that require a criterion 
measure, performance measure, or true ability continuum. Livingston 
(1975) has presented a procedure based on linear or seml-llnear utility 
functions in which he looks at the use of these functions in viewing 
the effects of decision-making accuracy based upon a particular per- 
formance standard. Livingston (1976) presented a method for choosing 
standards by stochastic approximation techniques. Once again, the 
procedure depends upon a performance measure, and a standard set on 
that measure. Huynh (1976) bases a standard-setting method 
for a competency test to an external criterion. 

Finally, the work of Van der Linden and Mellenbergh (1977) depends upon 
the existence of a latent ability variable that can be dichotomized 
into two categories, labeled "competent" and "incompetent." The 
standard is then set based upon a risk or expected loss function. 

These methods have only been briefly mentioned because they all 

r 

are difficult to apply In practice since they require a criterion vari- 
able upon which success and failure (or probability of success and fail- 
ure) can be defined. External criterion variables which would be ap- 
propriate for validating high school certification tests are going to 
be difficult to gain agreement about and probably very difficult to 
measure. For example, how would you go about defining "life success" 
and ii«^«fc«uflag it? Reading experts, for instance, are not going to have 
the same Idea about what the minimally competent person can read. 
Should he/she be able to read at 12th grade level, or the 8th grade 
level? For exaaqple, Jaeger (1978) has noted, "Educators would no 
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sconer agree on the proportion of New York Times front page passages 
eleventh-graders should be able to comprehe.id and explain, then they 
. would the proportion of multiple-choice test items those eleventh- 
graders should answer correctly, so as to be labeled "miaimally 
competent." Thus, the jist of this reasoning is that if agreement 
can't first be reached on the criterion measure, then this isn't going 
to aid in setting standards on the test. Given the situation, one 
may want to go ahead and try to set the standards on the test without 
considering criterion-measures. Such a recommendation seems especially 
relevant for promotion and high school certification examinations. 

One example of a decision-theoretic procedure is due to Krlewall 
(1972). This procedure is based upon the definition of (usually) two 
mastery states. The standard on the test is then selected as the 
point that minimizes "false positive" and "false negative" errors in 
the classifying of individuals into the defined mastery states. Once 
again, the problem with this method is evident. The mastery categories 
would in this case be "competent" and "incompetent," and they are 
essentially undefined. Until. people can agree on a definition of 
"competence" in a given situation, it is not possible to use the method. 
You cannot minimize errors of prediction if the categories to be pre- 
dicted can't be established. Jaeger (1978) has noted that many of the 
methods allow for different utilities to be associated with false 
positive and false negative errors, in this case passing the "minimally 
Incompetent" person or failing the "minimally competent" person. However, 
there are no guidelines for establishing these utility values, so 
another problem exists with the methods. 
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Flnally, Berk (1976) has presented a method that is very similar 
to the decision-theoretic methods just discussed. Rather than setting 
tho mastery states arbitrarily and observing the probabilities of raise 
positive and false negative errors on the criterion. Berk suggests the 
optimal standard be based on response data from samples of instructed 
and uninstructed students. Berk offers a number of procedures to be 
U3«d m conjunction with his method. We fee] that the procedure holds 
great merit for classroom instructional settings, and have devoted a 
great deal of time to a discussion of it in our recent review (Hambleton 
& Eignor, 1978). The problem involved with using the procedure for 
setting standards on mlnlmuu. competency tests is immediately evident. 
There is no simple way of establishing groups of students Instructed on 
the competencies included In the test and groups which have not had 
instruction. Other extreme groups might be formed (for example, 
•Weessful" adults and "unsuccessful" adults) and their performances 
compared on the test for the purpose of setting an optimum standard. 
Clearly though, results from such comparisons can be explained In 
numerous ways and therefore results of this sort have limited practical 
value. 

Block (1972) introduced a method referred to as "educational 
consequences." In this method one looks at the effert the setting of 
a standard of proficiency has on future learning or other related cogni- 
tive or affective success criteria. Block conducted an experimental 
•tudy to consider the effect of different standards on several outcome 
measures. The standard for which the valued outcome is maximal (It 
could b« « coBblnstlon of valued outcomes) becomes the standard the next 
tiaw the test is used. 
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Glass (1978a) has likened this approach to the general approach of 
operations research and the concern for maximizing a valued commodity 
by finding an optimum point on a mathematical curve. Glass has pointed 
out the need for non-monotonic curves relating performances to the 
valued outcomes, which are not likely to be the case, in order to 
locate a maximum. Glass also talks about the problem of how to weight 
individual outcomes to form a. composite outcome. There is yet another 
problem, perhaps even more serious than the non-monotonicity problem. 
One can't maximize a valued outcome if the outcome can't be defined in 
any reasonable manner. In suro» to utilize Block's method, there would 
have to be concensual agreement on what a valued outcome of being 
competent is. This would seem to be as difficult a task as trying to 
get people to d^ine behaviors associated with minimum competency* 

Finally, Millman (1973) has suggested that standards be adjusted 
for the effects of guessing. A systematic error is introduced when the 
test item format allows a student to answer items correctly by guessing. 
Millman suggests raising the standard to take into account the expected 
contribution attributed to pure guessing. Educational Testing Service 
has corrected the standards on the NTE exams and the Insurance Licensing 
Exams to take care of guessing. The problem here is that for minimum 
competency tests, pure random guessing rarely occurs and because of 
this, the effects of raising the 'Standards as if it had, is unknown. 
Clearly, more work in this area is needed* 

Bayesian methods will not be discussed because they allow 
standard setters to augment the setting of standards with prior infor- 
mation and/or group information on- the examinees in question. Bayesian 
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Mthods also provide a statement of probability concerning an examinee's 
true level of competency exceeding the standard. To use the Bayeslan 
Mthods, however, a standard must first exist. Any one of the methods 
to be discussed next could be used to set the standard. 



Judgmental Models 

What follows Is a brief discussion of several judgmental methods. 
Coonents, comparisons and recomendatlons for use will be offered also. 
Table 1 provides a sumnary of some of the similarities and differences 
among the mef.hods. 

1. Hedelsky's Methoa 

In Hedelsky's method, judges are asked to view each question in a 
test with a particular criterion in mind. The criterion for each question 
is. which of the response options should the minimally competent student 
(Nedelsky calls them D-f students) be able to eliminate as Incorrect. The 
minimum passing level (MPL) for that question then becomes the reciprocal 
of the rwralning alternatives. For Instance, if on a 5 alternative multlpl 
choice question, a judge feels that a minimally competent person could - 
eliminate two of the options, then for that question, MPL « ^. The 
Judges proceed with each question in a like fashion, and upon completion 
of the judging process, sum the values for each question to obtain a 
standard on the total set of test items. Next, the individual judge's 
standards are averaged. The average 1 denoted it^. 
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Table 1 



A Coin)arison of Several Standard Setting Methods 



- 

Question 


Nedelsky 


Modified 
Nedelsky 


Judgmental 
Angof f 


Modified 
Angof f 


Ebel 


Jaeger 


Combination ^ 

Contrasting Borderline 
Groups Group 


!• Is a aeiinition or the 

winimally competent individual 
Accessary? 


Yes 


Yes 


Yes 


Yes 


Yes 


No 


No 


Yes 


2* What is whe nature oi the 
rating tAsk-^or items, or 
individuals? 


Items 


Items 


Items 


Items 


Items 


Items 


Individuals 


Individuals 


3. Are examinee data needed? 


No 


No 


No 


No 


No 


No 


Yes 


Yes 


4. Do Judges have access to 
the items? 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Usually, 
but don't 
need to 


Usually 


5. Are the judgments made 
In a group setting or 
individual setting? 


Both 


Both 


Both 


Both 


Both 


Both 


Individual 


Individual 


Choices of methods to use 

for setting standards 

on minimum competency tests. 








/ 




/ 
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Nedeltky felt that if one were to compute the standard deviation 
of individual Judge's standard*, that this distribution would be 
aynonomous with the (hypothesized or theoretical) distribution of the 
scores of the borderline students. This standard deviation, o, could 
then be multiplied by a constant K, decided upon by the test users, to 
regulate how many (as a percent) of the borderline students pass or fail. 
The final formula then becomes: 

«o-f>o+ K 0 . 

How does the K o term work? As? -ning an underlying normal distri- 
bution, if one sets K-1, then SAX of the borderline examinees will fail. 
If K-2, then 98% of these examinees will fail. If K-0, then 50% of the 
examinees on the borderline should fall. The value for K is set by (say) 
a connlttee prior to the examination. 

The final result of the applications of Nedelsky's method will be 
an absolute standard. This is because the standard is arrived at in 
a manner independent of the score distributions of any reference 
groun. In fact, the standard is arrived at prior to application of 
the test to the group one is concerned about testing. However, while 
tho fltandnrd cm ho rnlled absolute, there is n great deal of ludRment 
invulvcd in apply Iuk the method. 
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11, Modified Nedelsky 

Nasslf (1978>, In setting standards on the competency-based 
teachers education and licensing systems In Georgia, utilized a modified 
Nedelsky procedure to set standards. A modification of the Nedelsky 
method was needed to handle effectively the volume of Items In the pro- 
gram. In the modified Nedelsky task, the entire Item (rather than 

-J 

each dlstractor) Is examined and classified In terms of two levels of 
examinee competence. The following question was asked about each Item: 
"Should a person with minimum competence In the teaching field be able 
to answer this Item correctly?" Possible answers were "yes," "no," and 
"I don't know." Agreement among Judges can be studied by a simple 
comparison of the ratings by judges to each Item. A standard may be 
obtained by averaging the number of "yes" responses given by judges to 
the set of test Items. 
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111, Ebel's Method 

Eb«l (1972) goM about arriving at a standard In a 
aoMvhat dlffarant mannar, but hit procadure Is also based upon the teat 
quastions rather than an **outslda" distribution of scores. Judges are asked 
to rata itama along two dimensions: Relevance and difficulty. Ebel uses four 
categories of relevance: Eaaentlal, important, acceptable and. questionable. He 

uses three difficulty levels: Easy, medium and hard. These categories then form 

(In this case) a 3 x 4 grid. The judges are next asked to do two things: 

1. Locate each of the test questions In the proper cell« based upon 
relevance and difficulty, 

2. Aaslgn a percentage to each cell; that percentage being the percentage 
of itama in the cell that the mlnimally-rualif led examinee should be 
able to answer. 

Than the number of questions in each call is multiplied by the appropriate 
percentage (agreed upon by the judges), and the sum of all the cells^ when 
divided by the total number of questions, yields the standard. 

Three comments can be made about Ebal's method that should be sufficient 
to convince people to be careful in using It. One, Ebel offers no prescription 
aa to what the nuiri>er or type of descriptions should be along the two dimen- 
sions. This is left up to the judgment of the individuals judging the items. 
It could llkaly be the case that « different set of dimensions applied to the 
same teat could yield a differ ant sundard. Two, the process is based upon 

the daciaiooa of judges » and while the standard could be called absolute in 
that it la referenced to no other distribution, it can't be called an **objec-* 
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tlve** standard. Three, a point about Ebel*s method has been offered by 



Meskauskas (1976): 

In Ebel's aethod, the judge must sinulate the decision 
process of the examinee to obti^in an accurate ^udgiment 
and thus set an appropriate standard. Since the judge 
Is more knowledgeable than the minimally-qualified > 
individual^ and since he is not forced to make a deci^^t^ 
about each of tlie alternatives^ it seems likely thati^'the 
judge would tend to systematically over-simplify the^ 
examinemk task . • . Even if this occurs only occasionally, 
it appears likely that, in contrast to the Kedelsky method, 
the Ebel method would aUow the ratcfrs to ignore some of 
the finer discriminations that an examinee needs to make 
and would result in a standard that is more difficult to , 
reach, (p. 138) 

iv. Angoff s Method 

When using Angoff's technique, judges are asked to assign a probability 
to each test item directly, thus circumventing the analysis of a grid or the 
analysis of response alternatives. Angoff (1971) states: 

. . .ask each judge to state the probability that the 
'minlaally acceptable person' would answer each item 
correctly. In effect, the judges would think of a 
number of mlnimally-'MC^table persons instead of only 
one such person » and woul^^^timate the proportion of 
mlrlmally' acceptable persons who would answer each item 
correctly* The sum of these probabilities » or propor- 
tions^ would then represent the ninimally acceptable 
score. (p. 515) • 



V. Modified Angoff 

ETS (1976) utilized a modification of Angoff 's method 

for setting standards. Based on the rationale that the task of 
assigning probabilities may be overly difficult for the items to be 
aasessed (National Teacher Exams) Educational Testing Service 
instead supplied a seven point scale on which certain percentages were 
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Th* following tcalt of fared: 

5 ' 20 40 60 75 90 95 DNK 

tihere "DNK" itand* for "Do Not Know." 

ETS has .180 u.ed scales with the fixed points at somewhat different 
values; the scales a , consistent though in that seven points are given to choose 
from. The National Teacher Exam program specified 60 as the center point 
sine* the average of percent correct on past exams centered around 60X. 
'The other options were then spaced on either side of 60. 



vi." Jaeger's Method > 

Jaeger (1978) recently presented a method for standatd-setting on the 
North Carolina High School Competency Teet. Jaeger's method encorporates 
a nuid>er of suggestions made by participants at a 1976 NCME annual meeting 
•yivosiua presented in San Francisco by Stoker .Jaeger, Shepard, Conaway. 
and Haladyna; it is iterative, based on judges from a variety of back- 
grounds, and employs normative data. Further, rather than asking a 
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xiue.tion involving "minimal competence," « term which is hard to opera- 
tlonallze, and conceptualize, Jaeger questions are instead: 

"Should every high tchool graduate be able to answer 
thla item correctly?" " Yes, No." and 

"If a stvdent does not answer this item correctly, ^ 
should he/she be denied a high school diploma?" 
" Yes, ^ No." 

After a series of iterative processes involving judges from various areas 
of expertise, and after the presentation of some normative data, 
standards determined by all groups of judges of the same type are 
pooled, and a median computed. The minimum median across all 
groups Is selected as the standard. 

Comparisons Among Judgmental Models 

Wd are aware of two studies that compare judgmental methods of 
setting stafidards; one^study was done In 1976, the other Is pre- 
sently underway at ETS. 

In 1976, Andrew and Hecht carried out an 
empirical comparison of the Nedelsky and Ebel methods. In 
the studyt Judges met on two separate occasions to set standards for a 
180 item; four options per item, exam to certify pttofesslonal workers. 
On one occasion the Nedelsky method was used. On a second occasion the Ebei method 
was used. The percentage of test itens that should be answered correctly 
by the minimally competent examinee was 69% by the Ebel method and 46Z by 
the Nedelsky method* 
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Class (1978a) described the observed difference as a "startling 
finding." Our view Is that since directions to the judges were } 
different, and procedures differed, we would not expect the results 
fro« th^se two methods to be slnllaiT. The authors themselves report: 

It Is perhaps not surprising that two nro..e.cures 
which involve different approaches tc -.-.e -vai- 
uacion of test items woulJ result in Mircrent 
examinatL.^n standards. Such oxar.in.ic ' ,tanaards 
will alwavs be subjective to siov.e extent anu will 
involve different phiiosoonical assunpc ns ana 
varying conceptualizations, (p- 

Ebel (1972) makes a ilmilar point: 

^ it Is clear that a variety of aot^roach-s can 
be' used to solve the problem of definLne ti.e pass- 
ing score. Unfortunately, different ..puroacnes 
are likely to give different results. ^. -'^o) 

Po«isibly the Eost important result of the AndrewrHecht study (and 

this result was not reported In the Glass paper) -.as the hi:4h level 

of agreement la the determination of a standard using the same 

•ecbod across two teams of judges. The difference vas not .ore than 3.4!? 

with each method . D-^a of this kind addressee a concern raised 

by r,l.s. (1978.) about whether judges can make determinations of 

.t«.dard. .on,istentiy and reliably. In it lea..r :h:s one st.udy. it 

appears that they cruld. From our interactions with ififf at ETS who 

. . we have learned 

conduct r»v::r.er v:r<.«nops c-r setting standards, 

that team, of r.Mchers working --ith a com=K,n r..tn.- .bcain resales that 
..re r.tt* M..ll..r. ^nd thi. r.^ .U holds v • oss t-.t. In lifferent 
,ub1.ct nuitt^r ar.as .a. at dirfcr.n: '.r.ri.. lovel. We have observed 
the sa»« revolt ir. r.v work, -urse. thcr-. are conditions which mast 
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Donald Rock at ETS is presently pursuing research on the use of 
the Nedelsky and Angoff methods for standard setting on Real Estate 
Certificatio.n Examinations, The results of this study, which have 
not been released, should shed some light on the comparability of 
the two Judgmental procedures used most frequently to date. 

Combination Mosiels 

Two very attractive methods which we will refer to as combination 
methods will be considered next. They were first proposed by Zieky 
and Livingston (1977). In these luethods, judges are asked to make 
Judgments of the mastery levels of students, rather than about test 
items. Teachers would be the most reasonable choice as the Judgements 
to be made concern a student's level of mastery of the area being 
tested. They must identify students as "adeouate," "inadequate," 
or "borderline" relative to the content area of interest. The task 
of Imagining a minimally competent student or group of students is 
circumvented, and for this reason alone, these methods are in favor. 
What follows is a very brief defi»criptlon of the two roethor s. Readers 
interested In a more thorough discussion, along with helpful hints 
for applying the methods should refer to Zieky and Livingston (1977). 

i. Borderline-Gri/UP Method 

Once teachers have identified a group of students whose achievement 
is Judged to be borderline in the area being tested » the test is ad- 
ministered and the median test score for this group becomes an estimate 
of the standard. 
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11, Contrastlng-Group Method 

Once teachers have Identified groups of students they are sure 
are definite masters or non-masters of the skills being measured by 
the test, the test Is given, and score distributions plotted for each 
group. The Intersection of the score distributions becomes the first 
estimate of the standard. This can then be .adjusted up or dovm to 
obtain the required balance between "false-positive" and "false- 
negative" errors. 

The Contrastlng-G.-oaps Method Is very similar to a method offered 
Independently by Berk (1976). Berk assumes that the students being 
assessed are masters or non-masters on the basis of whether or not 
they have been Instructed on the content measured by the test. On 
the other hand, Zleky and Livingston ask teachers to judge the students 
on the skills in the test. The major point to be made is that pro- 
cedures offered by Berk for analysis of the data (a validity coeffi- 
cient, utility analysis) are also applicable with the Contrasting- 
Groups Method. 

Some Final Remarks 

Our review of the literature identified a variety of methods 
for aretting standards. However, when one tries to apply these methods 
to minimum competency tests, problema arise. The empirical 
methods require an external criterion measure which often is very hard 
to obtain. When external criterion measures can be obtained* ^methods 
proposed by Livingston (1975, 1976), Huynh (1976), Van der Linden and 
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Mellenbergh (1976), Harobleton and Novlck (1973), Kri^wall (1972)-, and 
Berk (1976) will be very useful. At the present time, the best 
methods for setting standards on elementary and secondary school 
minimum competency tests are those that deal directly with the test. 
These methods do require judgments* and arbitrary standards are 
obtained. Given the state of affairs in the area of standard settings, 
however, we can only suggest that any method be carefully used, and 
that the expressed concerns and recommendations of researchers on 
this topic (for example, Conaway, 1976, 1977; Glass, 1978a, 1978b; 
Haladyna, 1976; Jaeger, 1976; Shepard, 1976) be carefully considered. 
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Suggestions for Future Research and Development 
In our paper we have Introduced a model for developing and validating 
competency tests and we have considered several methods of setting standards. 
In this final sedtion, several suggestions for future research and devel- 
opment will be offered. The suggestions are organized by the two major 
topics of the paper: 



Competency Test Development and Validation 

1. Technical guidelines are needed for the evaluation of competency 
tests and test manuals. The AERA/APA/NCME Test Standards have some 
value for this purpose but are incomplete and what relevant material 
there is in the Test Standards is scattered throughout a 75-page 
document. ^ 

2. Usable guidelines for determining test lengths (number of test items/ 
competency) are not available. There are several technical contri- 
butions on the problem in the literature but the contributions are 

rather complex jMthematicall^ therefore not readily usable by 

practitioners. 

3. More needs to be learned about the development and validation of 
performance tests since many of the competencies being discussed 
by designers of competency testing programs can be measured best 
by performance tests. 

4. Considerable attention should be given to the development of guide- 
lines for writing domain specifications. Also their use in devel- 
oping competency tests and in facilitating proper test score inter- 
pretations should be evaluated. Finally, the merits of domain 
specifications in comparison with other approaches for describing 
item pools (for example, algorithmic transformation of sentences from 
written instruction into test items, facet designs and others) 
should be considered. 

5. Latant trait models are being used in the develonment of some 
nom* referenced tests and in the interpretation of norm-referenced 
test scores. The models appear to have potential also for use with 
competency tests. Equating of scoti^s from one form of a coinpetency 
test to another is pne of the more promising applications. Clearly, 
mora research on the feasibility of using latent trait models with 
compatancy tests is called for. 
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Standard-Settlng Methods 

1. There is a need for considerably more work on both the moral and 
technical issues involved in standard-setting. 

2. There needs to be considerably more study of the term, "minimally 
, competent because if the term is better understood, it may be 

possible to link existing standard-setting methods to the intended 
meaning or meanings of the tern and thereby greatly facilitate the 
selection of a standard-setting method (or the development of new 
methods) . 

3. For "acceptable" standard-setting methods, implementation strategies 
need to be developed, evaluated, and made ready for wide use. At 
present there are few guidelines or procedural steps available for 
applying any of the standard-setting methods. (An exception to this 
Mo5?t work by Popham tl978bl and Zleky and Liv ngston 

The purposed of competency testing programs can only be accomplished 
(1) if quality competency tests are constructed and (2) if scores derived 
from the te sts are interpreted and used correctly. We -hope our paper 
will facilitate the accomplishment of both objectives. 
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